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ABSTRACT:  As  novel  and  drug-resistant  bacterial  strains 
continue  to  present  an  emerging  health  threat,  the  develop¬ 
ment  of  new  antibacterial  agents  is  critical.  This  includes 
making  improvements  to  existing  antibacterial  scaffolds  as  well 
as  identifying  novel  ones.  The  aim  of  this  study  is  to  apply  a 
Bayesian  classification  QSAR  approach  to  rapidly  screen 
chemical  libraries  for  compounds  predicted  to  have  anti¬ 
bacterial  activity.  Toward  this  end  we  assembled  a  data  set  of 
317  known  antibacterial  compounds  as  well  as  a  second  data 
set  of  diverse,  well-validated,  non-antibacterial  compounds 
from  215  PubChem  Bioassays  against  various  bacterial  species.  We  constructed  a  Bayesian  classification  model  using  structural 
fingerprints  and  physicochemical  property  descriptors  and  achieved  an  accuracy  of  84%  and  precision  of  86%  on  an  independent 
test  set  in  identifying  antibacterial  compounds.  To  demonstrate  the  practical  applicability  of  the  model  in  virtual  screening,  we 
screened  an  independent  data  set  of  ~200k  compounds.  The  results  show  that  the  model  can  screen  top  hits  of  PubChem 
Bioassay  actives  with  accuracy  up  to  ~7 6%,  representing  a  1.5—2-fold  enrichment.  The  top  screened  hits  represented  a  mixture  of 
both  known  antibacterial  scaffolds  as  well  as  novel  scaffolds.  Our  study  suggests  that  a  well-validated  Bayesian  classification  QSAR 
approach  could  compliment  other  screening  approaches  in  identifying  novel  and  promising  hits.  The  data  sets  used  in 
constructing  and  validating  this  model  have  been  made  publicly  available. 
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■  INTRODUCTION 

It  is  impossible  to  determine  the  number  of  bacterial  infections 
treated  each  year  worldwide.  According  to  World  Health 
Organization  (WHO),  the  top  five  infectious  diseases  with 
highest  death  rates  are  lower  respiratory  tract  infections  (3.9 
million),  diarrhea  (1.8  million),  tuberculosis  (1.6  million), 
pertussis  (290  000),  and  tetanus  (210  000).1  Considering  this 
staggering  number  of  deaths,  there  should  be  a  lucrative  market 
for  drug  therapies  for  these  diseases.  Indeed,  this  was  true  up 
until  the  early  1990s  when  around  20  pharmaceutical 
companies  were  involved  in  antibacterial  research.  Today 
only  two  are  active.2  In  the  last  25  years,  not  a  single  novel 
antibacterial  drug  class  has  been  discovered.  Though  many 
scientists  consider  the  last  three  major  classes  discovered  to  be 
novel,  oxazolidinones  (2000),  lipopeptides  (2003),  and  pleuro- 
mutilins  (2007),  they  were,  in  fact,  patented  in  1978, 3  1987, 4 
and  1952, 5  respectively. 

Multiple  reasons  have  been  cited  for  this  drift  from  the 
“golden  age”  (1945—1965)  to  the  “innovation  gap”  (1987  and 
onward)  of  discovering  novel  antibacterial  compounds.  Among 
many,  there  are  three  main  hurdles  to  success  in  this  area.  First 
is  “scientific  difficulties”  due  to  (i)  rapid  evolution  of  resistant 
strains  that  renders  even  the  newly  developed  antibacterials 
ineffective,  (ii)  lack  of  novel  screening  libraries  and  compounds, 
for  new  drug  discovery,  and  (iii)  difficult  to  manage  side-effects 
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due  to  high  dose  requirements  in  order  to  achieve  blood  levels 
necessary  for  efficacy.  Second,  is  “pharmaceutical  company 
disinterest”  due  to  (i)  lack  of  financial  gains  because  of  the 
short-duration  treatment  regimen  typically  prescribed  for 
antibacterials,  (ii)  difficulties  in  licensing,  and  (iii)  the  uncertain 
future  of  drugs  due  to  resistance.  Last,  but  not  the  least,  is 
“regulatory  hurdles”  due  to  (i)  The  Food  and  Drug 
Administration’s  delay  in  issuing  guidance  documents  regarding 
acceptable  study  designs  and  acceptable  efficacy  outcomes  and 
(ii)  requirements  for  studies  to  sufficiently  demonstrate  the 
superiority  over  current  treatment  regimens  of  drugs,  which 
leads  to  costly  and  difficult  clinical  trials.  Many  excellent 
reviews  and  articles  exist  in  the  literature  that  highlight  these 
problems  in  detail.6  9 

Despite  all  these  difficulties,  both  scientific  and  otherwise, 
there  is  a  perpetual  need  for  new  antibacterials,  as  a  result,  the 
antibacterial  product  pipeline  has  never  been  totally  empty. 
Many  new  antibacterials  have  been  approved  since  1970,  and 
almost  all  of  them  are  improved  versions  of  the  previously 
known  scaffold  classes.  Many  of  these  improvements,  some 
very  substantial,  have  yielded  analogues  with  broader 
antibacterial  spectra  to  avoid  resistance,  lesser  toxicity,  and 
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low  dose  regimens.  A  good  example  of  such  antibacterial 
evolution  is  cephalosporins.  This  class  is  one  of  the  most 
commonly  prescribed  class  of  antibacterials.10  Over  the  years, 
they  have  constantly  evolved  and  each  new  generation  (from  I 
to  V),  while  retaining  the  cephem  scaffold,  is  designed  to  have 
added  spectrum  of  activity  and/or  to  be  active  against  those 
bacteria  that  have  become  resistant  to  the  previous  gen¬ 
eration.11  Additional  examples  of  new  antibacterials,  approved 
since  2000,  based  on  known  scaffolds  include  doripenem  and 
ertapenem  (carbapenems);  tigecycline  (tetracyclines);  telithro- 
mycin  and  fidaxomicin  (macrolides);  telavancin  (glycopep- 
tides);  gemifloxacin  (quinolones);  linezolid  (oxazolidinones); 
dapromycin  (lipopeptides);  and  retapamulin  (pleuromutilin). 
Such  modifications  and  incremental  tailoring  are  not  only 
necessary  to  fight  the  resistant  pathogens  but  they  can  also  be 
used  to  maximize  the  therapeutic  potential  of  each  scaffold  to 
the  fullest.  This  shows,  that,  along  with  research  efforts  to 
discover  novel  scaffolds  we  also  need  to  devise  ways  to  further 
explore  the  known  scaffold  properties  of  current  antibacterials. 
This  is  especially  true,  since  it  is  well-known  that  the  currently 
known  small  molecule  space  is  sparsely  inhabited  by 
antibacterial-like  compounds.  Hence,  strategies  that  explore 
the  available  chemical  space  would  help  in  finding  new 
antibacterials. 

Structurally,  antibacterials  differ  significantly  from  other  drug 
classes,  such  as  drugs  targeting  human  proteins.12  Most  of  these 
differences  are  attributed  to  their  need  of  penetrating  and 
persisting  in  bacterial  cells  while  avoiding  human  cells.  The 
differences,  such  as  higher  molecular  weight  and  polarity,  and 
other  physicochemical  properties,  have  been  exploited  in  a  few 
previous  studies  that  utilize  binary  quantitative  structure- 
activity  relationship  (QSAR)  classification  models  to  distin¬ 
guish  between  antibacterials  and  non-antibacterial  com¬ 
pounds.14-20  These  attempts  include  the  use  of  techniques 
such  as  linear  discriminant  analysis,  binary  logistic  regression, 
and  artificial  neural  networks.  In  all  these  studies,  the  training 
data  sets  of  antibacterials  and  non-antibacterial  compounds 
range  from  24  to  249  and  35  to  731,  respectively,  where  all  the 
non-antibacterials  were  collected  from  the  Merck  Index  of 
compounds.21  Additionally,  many  QSAR  models  for  “class 
specific”  antibacterials,  such  as  fluoroquinolones,22  //-lactams,23 
and  aminoglycosides24  have  also  been  developed  and  appear  to 
be  useful  in  screening  potential  hits. 

In  the  current  study,  we  have  used  a  previously  unused 
Bayesian  classification  approach  to  build  a  QSAR  model  that 
can  distinguish  between  antibacterial  and  non-antibacterial 
compounds.  Bayesian  modeling  is  a  well-known  classification 
approach,  and  many  examples  of  its  utility  as  a  tool  in  drug 
discovery  and  structure— activity  analysis  exists.  Previously,  it 
has  been  used  successfully  in  finding  inhibitors  of  kinases,25  G- 
protein-coupled  receptors  (GPCRs),26  y  amino  butyric  acid 
type  A  (GABAa)  ionotropic  receptor,27  Mycobacterium  tuber¬ 
culosis,28  and  in  identifying  important  structural  features/ 
fragments  for  microsomal  stability22  and  human  ether-a-go-go 
related  gene  (hERG)  protein  blockers.30  Along  with  being 
deceptively  simple  and  robust,  another  major  strength  of 
Bayesian  approach  is  its  ability  to  rank  the  molecules  according 
to  their  probability  of  being  active.  This  ranking  of  molecules  is 
important  when  prioritizing  molecules  for  screening,  i.e., 
making  focused  libraries,  or  for  further  development. 

In  our  study,  we  utilized  structural  fingerprints  and  selected 
physiochemical  properties  of  317  known  antibacterials  to  build 
a  Bayesian  model.  Unlike  previous  antibacterial  classification 


studies,  our  collection  of  antibacterials  is  significantly  larger. 
Using  a  novel  strategy,  an  equal  number  of  non-antibacterials 
were  also  collected  using  inactive  compounds  from  215 
bacterial  bioassays  deposited  in  the  freely  available  PubChem 
repository  and  provided  by  the  National  Center  for 
Biotechnology  Information  (NCBI).31  This  is  different  from 
collecting  non-antibacterials  from  the  Merck  Index  of 
compounds,  which  is  not  open  source  and  where  most  of  the 
compounds  are  actually  drugs  targeting  human  proteins. 
Ultimately,  since  the  goal  of  this  model  will  be  to  enrich  for 
antibacterial  compounds  from  compound  libraries  typically 
used  for  antibacterial  screening,  we  curated  a  representative  set 
of  well-validated  non-antibacterials  from  data  from  actual 
antibacterial  screening  studies  available  through  PubChem. 
The  developed  Bayesian  models  were  validated  using 
independent  test  set  molecules  that  were  not  used  to  train 
the  models.  This  allowed  us  to  more  accurately  estimate  the 
prediction  power  of  the  models.  As  mentioned  above,  since  a 
model  would  be  more  useful  if  the  model  results  could  be 
translated  into  practical  virtual  screening  strategies,  we  further 
validated  our  approach  by  successfully  filtering  out  active  hits 
from  ~200  000  screening  molecules  that  were  used  to  find 
inhibitors  for  various  bacterial  pathogens  and  deposited  in 
PubChem  Bioassays.  Ultimately,  the  main  purpose  of  this 
model  is  to  make  predictions,  based  on  known  antibacterial  and 
non-antibacterials,  for  unknown  screening  compounds  in  order 
to  identify  the  analogues  that  contain  the  most  antibacterial  like 
structural  features  and  properties. 

■  METHODS  AND  MATERIALS 

Workflow.  The  workflow  followed  for  our  classification 
QSAR  model  building,  its  validation,  and  its  use  in  virtual 
screening  is  shown  in  Figure  1.  All  the  data  sets  were  collected 
from  DrugBank,32  PubChem,  and  literature.  Our  Bayesian 
model  was  built  using  training  set  molecules,  whereas  validation 
was  done  on  separate  test  set  molecules.  Finally,  virtual 
screening  was  done  on  the  collected  set  of  actives  and  inactives 
from  PubChem  Bioassays  for  pathogens.  Each  step  of  this 
workflow  is  described  in  detail  in  the  following  subsections. 

Data  Set  Collection.  We  collected  a  total  of  317  known 
antibacterials  from  the  literature  and  the  DrugBank  database  of 
compounds.  Structurally,  these  can  be  divided  into  nine  classes 
as  shown  in  Figure  2.  The  biggest  class  is  //-lactam  antibacterials 
that  constitute  roughly  1/3  of  all  the  antibacterials  and  include 
subclasses  such  as  cephalosporin,  penicillin,  carbapenem, 
monobactem,  and  oxacephem.  After  the  //-lactams,  quinolones, 
sulfonamides,  and  macrolides  constitute  the  next  three  largest 
compound  classes.  Unlike  antibacterials,  collecting  non¬ 
antibacterial  compounds,  compounds  that  are  inactive  in  a 
broad  panel  of  bacterial  species,  is  difficult  and  no 
straightforward  approach  or  database  is  readily  available  for 
this  task.  In  our  study,  to  collect  a  database  of  non-antibacterial 
compounds  we  used  publically  available  PubChem  Bioassay 
results  as  provided  by  the  NCBI.  A  total  of  215  different 
bacterial  bioassays  were  available  in  PubChem  at  the  time  of 
this  study.  Out  of  these,  only  30  bioassays  screened  10  or  more 
compounds.  From  this  subset  of  30  bioassays,  we  selected  all 
350  000  unique  inactive  compounds.  We  further  selected  only 
those,  a  total  of  190  477,  compounds  that  were  found  to  be 
inactive  in  at  least  7  or  more  different  bacterial  bioassays.  This 
is  still  a  huge  collection  of  compounds  compared  to  our 
antibacterial  data  set  of  317  compounds.  Because  a  QSAR 
classification  model  works  best  if  the  data  set  compounds  are  as 
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Figure  1.  Workflow  for  QSAR  classification  model  building,  validation, 
and  virtual  screening  (VS)  as  applied  to  antibacterial  and  non¬ 
antibacterial  data  sets.  The  number  of  compounds  is  also  shown  for 
some  of  the  steps,  in  parentheses. 


diverse  as  possible,  we  did  a  clustering  analysis,  and  based  on 
Tanimoto  coefficient  values,  selected  the  top  10  000  most 
diverse  compounds  from  the  pool  of  190  477  non-antibacte- 
rials.  In  this  clustering  task,  molecular  similarity  was  done  based 
on  the  Tanimoto  distance  between  molecules  using  the 
ECFP_6  fingerprint  property  (atom  type-based  extended 
connectivity  fingerprint).33  The  maximum  dissimilarity  of 
center  selections  were  picked  from  the  diverse  outer  edges  of 
the  clusters.  Additionally,  for  better  classification  models,  the 
two  data  sets,  the  antibacterials  and  the  diverse  set  of  inactives, 
should  be  as  closely  matched  as  possible  so  that  only  the  best 
discriminating  features  between  the  two  sets  can  be  collected. 
To  do  this,  we  structure-matched  the  10  000  compounds  with 
the  317  antibacterials  and  selected  the  same  number  of  most 
closely  matched  inactives  to  keep  a  ratio  of  1:1  between  the  two 
sets.  We  labeled  this  inactive  set  of  317  compounds  as  the  non¬ 
antibacterial  data  set. 

Model  Building.  On  the  collected  data  sets  of  antibacterials 
and  non-antibacterials,  we  applied  the  Bayesian  classification 


approach  which  is  based  on  a  learn-by-example  protocol,  as 
implemented  in  Pipeline  Pilot,  version  8.O.3  The  Bayesian 
approach  is  a  robust  classification  approach  that  can  distinguish 
between  active  and  inactive  compound  sets.  Complete  details 
of  the  Bayesian  method  are  described  elsewhere,25  but  in  short 
the  technique  is  based  on  the  frequency  of  occurrence  of 
various  descriptors  that  are  found  in  two  or  more  sets  of 
molecules  that  discriminate  best  between  these  sets.  The  model 
learning  process  starts  by  generating  a  large  set  of  binary  (yes/ 
no)  features  from  the  input  set  of  descriptors,  structural  and/or 
physicochemical,  and  then  collects  the  frequency  of  occurrence 
of  each  feature  in  the  “good  (active)”  subset  and  among  the  “all 
data  set”  compounds.  To  apply  the  model  to  a  particular 
sample,  the  features  of  the  sample  are  generated,  and  a 
Laplacian  adjusted  weight  is  calculated  for  each  feature  based 
on  a  probability  estimate.  Finally,  the  weights  are  added  to 
create  a  weight  sum  which  provides  a  relative  predictor  of  the 
likelihood  of  that  sample  being  from  the  “good  (active)”  subset. 

In  our  approach,  we  selected  both  the  structural  descriptors, 
molecular  function  class  fingerprints  of  maximum  diameter  6 
(FCFP_6),33  and  the  physiochemical  descriptors  SlogP, 
molecular  weight,  number  of  rotatable  bonds,  number  of 
rings,  number  of  aromatic  rings,  number  of  hydrogen  bond 
acceptors,  number  of  hydrogen  bond  donors,  and  molecular 
polar  surface  area.  All  the  physicochemical  descriptors  were 
precalculated  with  Chemical  Computing  Group’s  Molecular 
Operating  Environment  (MOE),  v.  2010. 10.35  The  compounds 
were  divided  into  the  training  and  test  sets  by  randomly 
selecting  80%  of  the  antibacterials  and  non-antibacterials  for 
training  and  the  remaining  for  testing.  To  test  whether  the 
random  selection  of  compounds  for  training  and  testing  created 
a  bias,  we  repeated  the  selection  10  times  and  applied  the 
algorithm  to  each  set.  We  did  not  detect  any  significant 
difference  between  the  various  data  set  results.  Finally,  the 
model  was  built  using  the  training  data  set  of  compounds. 

Model  Validation.  The  model  validation  was  done  using 
leave-one-out  cross-validation.  In  this  technique,  each  com¬ 
pound  is  left  out  one  at  a  time,  and  the  model  built  form  the 
remaining  compounds  is  used  to  predict  the  left  out  compound. 
Once  all  the  compounds  pass  through  this  cycle  of  prediction,  a 
Receiver  Operator  Characteristic  (ROC)  plot  is  generated  and 
the  area  under  the  curve  (AUC)  is  measured.  Predictions  were 
made  for  both  the  training  set  and  test  set  compounds.  Table  1 
gives  the  definition  and  relationship  of  the  statistical  parameters 
calculated  to  determine  the  quality  of  the  model,  i.e.,  accuracy, 
sensitivity,  specificity,  precision,  and  kappa. 

For  classification  models,  the  kappa  value  is  considered  as 
true  accuracy,  because  the  agreement  by  chance  is  corrected  for 
and,  hence,  it  is  a  better  statistical  parameter  than  accuracy  to 
estimate  the  prediction  power  of  the  model.  A  model  is  often 
considered  useful  if  its  kappa  value  is  >0.4.29 

Virtual  Screening.  To  test  how  well  the  model  performs  in 
a  real  virtual  screening  experiment,  we  prepared  two  data  sets. 
In  one  data  set,  we  mixed  20  984  PubChem  active  compounds 
and  an  equal  number  of  inactive  compounds  randomly  selected 
from  the  pool  of  190  477  inactive  compounds  collected  from 
PubChem  Bioassays.  This  gives  a  ratio  of  1:1  for  actives  versus 
inactive.  In  another  set,  we  mixed  20  984  actives  with  all  the 
190  477  compounds  that  were  inactives  in  seven  or  more 
PubChem  Bioassays.  This  gives  a  ratio  of  ~1:9  for  actives 
versus  inactives.  It  is  fairly  easy  to  collect  and  prepare  other 
ratio  data  sets  also,  but  we  feel  these  two  ratios,  1:1  and  1:9  of 
actives  versus  inactives,  are  sufficient  to  give  an  indication  of  the 


30  Bacterial  Bioassays 
(PubChem) 


DrugBank/ 

Literature 
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h  Oxazolidinone  (1 .5%) 
Macrolide  (8.6%) 


Aminoglycoside  (6.4%) 


Miscellaneous  (16.2%) 
Pleuromutilin  (1.5%) 
Amphenicol  (1.2%) 
Diaminopyrimidine  (1 .2%) 
Lincosamide  (0.9%) 
Streptogramin  (0.6%) 
Steroid  (0.3%) 

Other  (10.5%) 


a  Peptide  (7.6%) 


Beta-lactam  (33.4%) 
Cephalosporin  (19.1%) 
Penicillin  (10.7%) 
Carbapenem  (1.8%) 
Monobactam  (1 .2%) 
Oxacephem  (0.6%) 


Tetracycline  (5.2%) 


Quinolone  (12.8%) 


*  Sulfonamide  (8.3%) 


Figure  2.  Pie  chart  of  317  antibacterials  from  9  different  classes  collected  from  DrugBank  and  literature.  Percentage  of  total  number  of  compounds 
for  each  class  is  shown. 


Table  1.  Definition  of  Classification  Model  Performance 
Measures  between  Predicted  and  Observed  Parameters  for 
Two  Data  Sets,  Antibacterial  and  Non- Antibacterial*1 

predicted 

antibacterial  non-antibacterial 

antibacterial  true  positive  (TP)  false  negative  (FN) 

observed  .  .  .  . 

non-antibacterial  false  positive  (FP)  true  negative  (TN) 

“Various  statistical  parameters  can  be  calculated  based  on  this 
relationship.  N  (total)  =  TP  +  FP  +  FN  +  TN.  Accuracy  (proportion 
of  true  prediction  in  the  entire  population)  =  (TP  +  TN)/N. 
Sensitivity  (ability  to  correctly  predict  positive  results)  =  TP/(TP  + 
FN).  Specificity  (ability  to  correctly  predict  negative  results)  =  TN/ 
(FP  +  TN).  Precision  (proportion  of  true  prediction  against  all  true 
results)  =  TP/ (TP  +  FP).  Kappa  =  ((TP  +  TN)  -  (((TP  +  FN)(TP  + 
FP)  +  (FP  +  TN)  (FN  +  TN))/N))/(N  -  (((TP  +  FN)(TP  +  FP)  + 
(FP  +  TN)(FN  +  TN))/N)). 


quality  of  the  model  performance  in  virtual  screening 
experiments.  Moreover  the  ratio  of  1:1  reflects  the  similarity 
with  a  data  sets  ratio  used  for  training,  while  the  ratio  of  1:9  is 
more  reflective  of  real  case  virtual  screening  cases  where  active 
and  inactive  ratio  is  highly  imbalanced,  in  favor  of  inactives. 
The  model  was  tested  on  both  data  sets  and  the  number  of 
active  compounds  found  in  the  top  10,  50,  100,  500,  and  1  000 
predicted  compounds  were  calculated.  A  comparison  of 
Bayesian  model  screening  was  also  performed  with  a 
similarity-based  screening  method  on  the  same  two  data  sets. 
The  2D  similarity  screening  was  carried  out  using  Tanimoto 
coefficients  computed  based  on  the  structural  fingerprint 
FCFP_6.  In  both  the  data  sets,  a  1:1  and  1:9  ratio  of 
PubChem  actives  versus  inactives  was  screened  for  similarity 
with  the  317  antibacterials.  Similar  to  the  Bayesian  model 
assessment,  from  both  the  data  sets  we  extracted  the  top  10,  50, 
100,  500,  and  1  000  most  similar  hits  (most  similar  to  any  of 
the  317  antibacterials).  Finally,  the  numbers  of  PubChem 


actives  in  those  sets,  i.e.,  in  top  10,  50,  100,  500,  and  1000,  were 
calculated.  The  numbers  of  PubChem  actives  (enrichment)  in 
these  sets  were  compared  with  the  number  of  PubChem  actives 
obtained  from  Bayesian  screening  sets. 

■  RESULTS  AND  DISCUSSION 

The  data  set  collection,  model  development,  and  validation 
procedure  described  in  this  study  provided  a  robust  and 
straightforward  approach  for  estimating  the  antibacterial-like 
probabilities  of  a  small  molecule.  This  includes  an  estimate  of 
the  accuracy  and  the  predictive  power  of  the  developed 
Bayesian  classification  model.  The  data  sets  required  for 
building  such  a  model  requires  one  active  antibacterial  data 
set  and  one  inactive  non-antibacterial  data  set. 

Antibacterials  and  Nonantibacterials.  Antibacterials 
constitutes  a  very  heterogeneous  set  of  compounds.  They 
occupy  a  unique  physicochemical  property  space,  as  compared 
to  drugs  targeting  human  proteins  and  compared  with 
compounds  that  are  commonly  found  in  screening  libraries.12 
In  Table  2,  we  provide  the  mean  values  of  nine  physicochemical 
descriptors,  molecular  weight  (Wt),  hydrogen  bond  acceptors 
(HBA)  and  donors  (HBD),  number  of  nitrogen  (nN)  and 
oxygen  atoms  (nO),  number  of  rings  (Rings),  log  of  the 
octanol/water  partition  coefficient  (SlogP),  topological  polar 
surface  area  (TPSA),  number  of  rotatable  bonds  (RB),  and  two 
violation  counts  using  Lipinski  (LV)36  and  Oprea  (OV)37  rules, 
for  the  three  sets  of  compounds  classified  as  antibacterials,  non¬ 
antibacterials,  and  drugs  targeting  human  proteins.  These  mean 
values  amply  demonstrate  the  substantial  differences  between 
these  three  compound  categories. 

Overall,  antibacterials  have,  roughly,  50%  higher  weight,  60— 
130%  more  acceptors  and  donors,  30—90%  more  nitrogen  and 
120%  more  oxygen  atoms,  40%  higher  flexibility  (RB),  150% 
lower  solubility,  90—120%  higher  polarity  (TPSA),  and  30% 
more  ring  structures,  compared  to  the  non-antibacterial  and 
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Table  2.  Comparison  of  Average  (Mean)  Compound  Property  Values  of  Three  Data  Sets  Representing  Antibacterials,  Non- 
Antibacterials,  and  Drugs  Targeting  Human  Proteins'1 


average  descriptor  value 


DB 

class 

N 

Wt 

HBA 

HBD 

RB 

SlogP 

TPSA 

Rings 

nN 

nO 

LV 

OV 

AG 

24 

511 

9.9 

5.5 

7.3 

-9.3 

271 

3.2 

4.8 

9.9 

2.5 

4.8 

BL 

104 

450 

4.3 

1.7 

7.8 

-1.5 

154 

3.5 

4.6 

5.8 

0.8 

1.4 

ML 

26 

726 

10.5 

3.4 

8.8 

1.4 

184 

3.5 

1.8 

12.0 

2.1 

3.3 

OX 

6 

412 

4.0 

1.5 

6.8 

1.2 

108 

3.7 

4.5 

4.7 

0.2 

1.2 

AB 

PE 

24 

1225 

14.1 

12.0 

25.7 

-4.7 

460 

5.0 

12.6 

16.1 

2.7 

4.5 

QL 

41 

362 

1.9 

0.1 

3.1 

-0.3 

88 

3.9 

3.0 

3.5 

0.0 

0.3 

SL 

27 

289 

3.2 

1.9 

3.9 

1.4 

103 

2.1 

3.6 

2.9 

0.1 

0.1 

TC 

17 

503 

4.3 

4.1 

4.2 

-1.9 

195 

4.2 

2.4 

8.6 

1.8 

3.1 

MS 

48 

563 

6.9 

3.3 

8.4 

2.2 

153 

3.7 

2.6 

8.0 

1.2 

2.4 

all 

317 

530 

6.0 

3.1 

8.1 

-1.1 

177 

3.6 

4.3 

7.3 

1.1 

2.0 

NAB 

317 

354 

3.8 

1.8 

5.9 

2.7 

92 

2.7 

3.3 

3.3 

0.3 

0.6 

DHP 

527 

353 

2.8 

1.3 

5.9 

1.7 

78 

2.8 

2.2 

3.3 

0.3 

0.7 

a  Antibacterials  are 

further  divided  into 

nine  classes  and  their 

comparison  is 

also  shown.  [Abbreviations:  database  (DB),  antibacterials  (AB), 

non- 

antibacterials  (NAB),  drugs  for  human  proteins  (DHP),  aminoglycosides  (AG),  /J-lactams  (BL),  macrolides  (ML),  oxazolidinones  (OX),  peptides 
(PE),  quinolones  (QL),  sulfonamides  (SL),  tetracyclines  (TC),  and  miscellaneous  (MS)]. 


Table  3.  statistical  Outcome  of  the  Performance  of  the 
Bayesian  Classifiers  for  the  Training  and  Test  Sets  Molecules 


parameters 

training 

test 

N 

506 

128 

good  (antibacterials) 

253 

64 

bad  (non-antibacterials) 

253 

64 

TP 

226 

51 

TN 

236 

57 

FP 

17 

8 

FN 

27 

12 

accuracy 

0.91 

0.84 

precision 

0.93 

0.86 

sensitivity 

0.89 

0.81 

specificity 

0.93 

0.88 

kappa 

0.83 

0.69 

drugs  targeting  human  proteins.  The  antibacterial  compounds, 
on  average,  also  violate  at  least  one  Lipinski’s  rule-of-5  and  two 
Oprea’s  lead-like  criteria  of  small  molecules.  Among  the 
antibacterials,  most  nondrug  like  classes  are  aminoglycosides, 
peptides,  and  macrolides,  each  showing  Lipinski  violation 
counts  of  4.8,  4.5,  and  3.3,  respectively.  On  other  hand,  the 


most  druglike  are  sulfonamides  and  quinolones,  each  showing 
Lipinski  violation  counts  of  only  0.1  and  0.3,  respectively. 

Bayesian  Model  Development  and  Validation.  To 
exploit  the  differences  in  both  structural  properties  and 
physiochemical  properties  and  between  antibacterial  and  non¬ 
antibacterial  compounds,  we  a  used  Bayesian  classification 
technique  as  implemented  in  Pipeline  Pilot.  In  this  classification 
scheme,  317  antibacterial  compounds  are  classified  as  “good” 
samples  and  317  non-antibacterial  compounds  as  “bad” 
samples.  Here  the  “good”  and  “bad”  are  arbitrary  labels  to 
distinguish  the  two  sets  of  compounds.  The  combined  data  sets 
were  further  divided  into  a  80:20  ratio  to  make  a  training  set 
(506  compounds)  and  a  test  set  (128  compounds).  The 
Bayesian  model  was  built  from  the  training  set  compounds, 
using  both  the  structural  and  the  physicochemical  property 
parameters.  The  model  was  validated  using  a  leave-one-out 
cross-validation  method  where  one  compound  is  removed  from 
the  data  set  and  its  class,  good  or  bad,  is  predicted  using  the 
model  derived  from  the  rest  of  the  data  set  compounds.  An 
accuracy  of  91%  was  obtained  with  this  the  model.  The  same 
model  was  also  used  to  predict  the  test  data  set  of  128 
compounds.  For  the  test  data  set,  an  accuracy  of  84%  was 
obtained.  The  rest  of  the  statistical  parameters  are  shown  in 


Figure  3.  ROC  plot  (left)  showing  area  under  the  curve  and  enrichment  plot  (right)  showing  the  percentage  of  top  retrieved  actives  for  test  set 
molecules. 
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Score:  0.556  Score:  0.550 


Score:  0.549 


Score:  0.546  Score:  0.542 


Score:  0.541  Score:  0.539 


Score:  0.528  Score:  0.526 


Score:  0.539 


Score:  0.524 


Score:  0.537 


Score:  0.524 


Score:  0.536 


Score:  0.522 


Score:  -2.780  Score:  -2.721 


Score:  -2.497 


Score:  -2.389  Score:  -2.366 


Score:  -2.366 


Score:  -2.475 


Score:  -2.3 1 1 


Score:  -2.467 


Score:  -2.253 


Score: -2.192  Score: -2.126 


Score:  -1.993 


Score: -1.981  Score: -1.935 


Figure  4.  Examples  of  the  top  15  good  (top)  and  bad  (bottom)  fragments  estimated  by  Bayesian  modeling.  The  Bayesian  score  (Score)  is  given  for 
each  fragment. 


Table  3.  The  ROC  plot  and  the  enrichment  plot  for  the  test  set 
are  shown  in  Figure  3.  As  expected,  the  model  behaved  better 
for  the  training  set,  but  the  outcome  was  still  very  good  for  the 
independent  test  set  compound  classification  with  a  precision 
of  86%  and  a  kappa  value  of  0.69  for  the  classification  of  the 
compounds.  Sensitivity  and  specificity  were  81%  and  88%  for 
the  test  set  compounds. 

Antibacterials:  Good  and  Bad  Fragments.  One  of  the 

advantages  of  using  a  Bayesian  classifier  based  on  structural 
fingerprints,  such  as  FCFP_6,  is  that  it  can  identify  important 
fragments  or  fingerprint  features  frequently  found  in  two 
classifying  groups.  From  a  total  of  7  232  FCFP_6  features  that 
we  used  in  making  the  model,  the  top  15  good  and  top  15  bad 


diverse  fragments,  favorable  and  unfavorable  for  the  anti¬ 
bacterial  classification,  are  shown  in  Figure  4.  [It  is  important  to 
note  that  these  30  fragments  by  no  means  represent  all  the 
antibacterial  structural  information,  since  it  is  a  very  small 
percentage  (<0.5%)  of  the  total  number  of  features  used  in 
building  the  model]. 

As  expected,  many  of  the  top  good  features  contain  — NH3+/ 
— NH2+/— NH+,  thiazole,  penem,  cephem,  or  quinolone  frag¬ 
ments  that  are  common  fragmental  features  of  aminoglycosides, 
peptides,  /f-lactams,  and  quinolones,  which  constitute  a 
majority  of  known  antibacterials.  Interestingly,  in  the  top  bad 
fragments,  many  0=S(— N)=0  containing  fragments  were 
found,  even  though  these  are  part  of  one  of  the  most  populated 
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Table  4.  Statistical  Outcome  of  the  Performance  of  the 
Bayesian  Classifiers  for  the  Training  and  Test  Sets  Molecules 
That  Do  Not  Include  Sulfa  Compounds 


parameters 

training 

test 

(no  sulfa 

com] 

pounds) 

total 

485 

124 

good  (antibacterials) 

232 

59 

bad  (non-antibacterials) 

253 

65 

true  positives 

228 

51 

true  negatives 

232 

54 

false  positives 

21 

11 

false  negatives 

4 

8 

accuracy 

0.95 

0.85 

precision 

0.92 

0.82 

sensitivity 

0.98 

0.86 

specificity 

0.92 

0.83 

kappa 

0.83 

0.69 

Table  5.  Enrichment  Results  of  Two  Data  Sets, 

For  Two 

Screening  Methods" 

screening 

data  set 

number  of  actives  in 

screening  no.  of 

no.  of 

top  top 

top 

top  top  1 

method  actives 

inactives 

10  50 

100 

500  000 

Bayesian  model  20  974 

20  974 

9  43 

85 

405  758 

screening  20  974 

190  159 

9  26 

45 

155  276 

similarity-based  20  974 

20  974 

10  33 

55 

231  439 

screening  20  gy4 

190  159 

9  27 

30 

70  115 

^In  one  data  set,  the  ratio  of  actives 

versus  inactives  i 

is  1:1  (20  974 

actives  and  equal  number  of  inactives),  and  in  another  it 

is  1:9  (20  974 

actives  and  190  159  inactives).  The  number  of  actives  retrieved  by 

both  the  methods  in  top  10,  50,  100,  500,  and  1  000 

compounds  is 

shown. 

class  (sulfa)  of  antibacterials.  The  reason  that  the  sulfonamide 
moiety  was  selected  as  a  bad  fragment  is  because  of  its  common 
occurrence  in  many  non-antibacterial  compounds.  For  example, 
other  than  antibacterials,  sulfa  compounds  are  also  used  in 
diuretic,  anticonvulsants,  and  many  dermatological  drugs. 
Hence,  because  of  their  wide  occurrence  in  both  antibacterials, 
non-antibacterials,  and  in  general  screening  libraries,  these 
fragments  are  given  a  bad  Bayesian  score  as  they  cannot  be  used 
as  distinguishing  features  between  the  antibacterials  and  non¬ 


antibacterial  compounds.  [To  provide  an  estimate  of 
sulfonamide  scaffold  popularity,  we  calculated  that  out  of 
~62  million  compounds  available  in  the  ChemNavigator 
database38  of  purchasable  compounds,  7.6  million  or  12.5% 
of  all  compounds  were  sulfonamides]. 

In  model  development,  if  we  exclude  the  whole  sulfa  class  of 
antibacterials  from  the  training  set  compounds  and  follow 
exactly  the  same  steps  as  in  the  previous  model  building,  the 
resulting  model  behaves  almost  like  a  perfect  model.  The  ROC- 
AUC  value  of  such  a  model  is  0.99.  Other  statistical  validation 
parameters  of  the  sulfa-excluding  Bayesian  model  gives  an 
accuracy  of  95%,  a  precision  of  92%,  a  sensitivity  of  98%,  and  a 
specificity  of  92%  in  the  classification  of  the  training  set 
compounds.  For  the  test  set,  the  accuracy  and  precision  of  this 
model  were  85%  and  82%,  while  the  sensitivity  and  specificity 
were  86%  and  83%.  The  complete  set  of  parameters  are 
provided  in  Table  4. 

Antibacterial  Bayesian  Model  in  Virtual  Screening.  In 

a  study  like  this,  the  primary  objective  of  the  in  silico  screen  is  to 
determine  whether  the  model  can  distinguish  and  classify 
unknown  structures  as  good  or  bad.  This  is  a  common  situation 
in  drug  discovery  where  one  wants  to  retrieve  active  analogues 
from  screening  databases  based  on  initial  leads.  Therefore,  to 
validate  the  predictive  power  of  our  Bayesian  model  in  a  real 
test  case  scenario,  we  again  used  compounds  from  the 
PubChem  collection  of  bacterial  bioassays.  A  total  of  215 
different  bacterial  assays  were  selected  from  PubChem.  Any 
bioassay  that  screened  less  than  10  compounds  was  excluded. 
From  the  remaining  30  bioassays,  we  collected  350  000 
screened  compounds.  From  this  data  set,  we  further  selected 
a  total  of  190  477  compounds  that  were  found  to  be  inactive  in 
at  least  7  or  more  different  bacterial  assays.  From  this  set,  the 
317  compounds  that  were  used  in  developing  the  model  as 
inactives  (non-antibacterials)  were  removed.  Finally,  we  had 
190  159  compounds  as  inactives.  From  the  same  215  assays,  we 
also  collected  all  the  compounds  that  were  flagged  as  active. 
Any  known  antibacterial  tested  in  this  set  was  also  removed. 
This  gives  us  20  974  active  compounds.  This  data  set  of  inactive 
and  active  molecules  represents  a  completely  independent  data 
set  from  the  one  used  to  build  or  test  the  Bayesian  model.  This 
data  set  was  further  divided  into  two  subsets  before  model 
evaluation.  This  is  done  because  the  outcome  of  high 
throughput  screening  assays  are  highly  imbalanced  between 


Fraction  of  Samples  Fraction  of  Samples 


Figure  5.  Enrichment  plot  showing  the  percentage  of  top  retrieved  actives  in  0.5%  of  screened  database  for  two  virtual  screening  data  sets  selected 
from  PubChem  Bioassays  for  various  bacterial  species. 
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Species:  M.  tuberculosis 
Class:  Aminoglycoside 


Species:  B.  subtilis  Species:  S.  aureus 

Class:  Ramycin-like  Class:  Anthracycline 


CID:  5482874 
AID:  1850 
Species:  S.  typhi 
Class:  Rifamycin-like 


CID:  24868269 
AID:  1490 
Species:  B.  subtilis 
Class:  Penicillin 


CID: 3794850 
AID:  1902 

Species:  S.  pyrogenes 
Class:  Quinolone 


CID:  16406961 
AID:  588726 
Species:  M.  tuberculosis 
Class:  Novel 


CID:  11949015 
AID:  1490 
Species:  B.  subtilis 
Class:  Novel 


O 


CID: 4414227 
AID:  1490 
Species:  B.  subtilis 
Class:  Novel 


CID:  16758507 
AID:  1490 
Species:  B.  subtilis 
Class:  Novel 


CID:  1945566 
AID:  588549 
Species:  M.  tuberculosis 
Class:  Novel 


CID:  9799192 
AID:  1490 
Species:  B.  subtilis 
Class:  Novel 


Figure  6.  Structures  of  some  of  the  top  retrieved  hits  by  the  Bayesian  model  in  virtual  screening  of  the  PubChem  database.  Each  of  these  was 
experimentally  found  to  be  active  in  different  PubChem  bioassays.  The  PubChem  identity  number  (CID),  bioassay  identity  number  (AID),  species 
screened,  and  classification  of  antibacterial  scaffold  are  provided  for  each  hit. 


active  and  inactives,  in  favor  of  inactives.  For  example,  among 
PubChem  Bioassays,  in  most  cases,  the  experimental  hit  rate 
does  not  exceed  0.5%.39  This  imbalance  poses  a  significant 
problem  for  classification  models  because  models  that  correctly 
predict  the  same  fraction  of  objects  in  each  class  will  have 
different  objective  function  values.  Hence,  we  developed  two 
independent  data  sets  where  in  first  we  kept  the  ratio  of  actives 
versus  inactive  1:1  and  second  where  we  kept  the  ratio  1:9  for 
actives  versus  inactives. 


Next,  we  performed  another  test  to  compare  how  good  our 
model  results  were  as  compared  to  2D  similarity-based 
screening.  Among  the  ligand-based  screening  methods,  the 
2D  similarity-based  screening  is  one  of  the  most  popular 
methods  of  choice.  This  is  not  only  because  of  the  2D  method’s 
computational  efficiency  but  also  because  of  its  demonstrated 
effectiveness  in  multiple  studies.1'40-46  The  2D  similarity 
screening  was  carried  out  using  Tanimoto  coefficients 
computed  from  structural  fingerprint  (FCFP_6).  Both  the 
data  sets,  of  1:1  and  1:9  ratio  of  PubChem  actives  versus 
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inactives,  were  screened  for  similarity  with  317  antibacterials. 
The  top  1  000  most  similar  hits  for  both  the  data  sets  were 
extracted  and  the  number  of  actives  in  those  sets  were 
calculated. 

The  results  of  both  the  Bayesian  model  screening  and  the  2D 
similarity-based  screening,  for  both  the  data  sets,  are  shown  in 
Table  5.  Since  the  hit  rate  in  actual  high-throughput  screening 
does  not  exceed  0.5%,  we  only  show  the  actives  extracted  from 
the  top  1  000  hits.  In  the  first  virtual  screen  of  equal  actives  and 
inactives  (l:l),  the  compounds  extracted  using  the  Bayesian 
model  contained  90%,  85%,  and  ~76%  actives  from  the  top  10, 
100,  and  1  000  hits,  respectively.  For  the  1:9  ratio  of  actives  and 
inactives  of  the  second  set,  compounds  extracted  using  the 
Bayesian  model  contained  90%,  45%,  and  27.6%  actives  from 
the  top  10,  100,  and  1  000  hits,  respectively.  Compared  to  the 
actual  high-throughput  screen  outcome  for  actives  in  PubChem 
Bioassays,  there  is  a  significant  improvement  in  extracted  hits  in 
Bayesian  model  screening.  For  the  model,  the  top  0.5%  results 
are  also  shown  as  enrichment  plots  for  both  the  subsets  of  1 : 1 
and  1:9  ratios  of  active  versus  inactive  (Figure  5).  In 
comparison,  the  early  enrichment  of  the  2D  similarity  method 
was  comparable  to  the  Bayesian  method.  In  the  top  10  and  50 
hits,  the  similarity  method  screened  100%  to  66%  actives  in  the 
1:1  data  set  of  actives  and  inactives  and  90%  to  54%  actives  in 
the  1:9  data  set  of  actives  and  inactives.  In  later  enrichment 
when  all  the  antibacterial-like  structures  were  exhausted  in  the 
similarity  screen,  the  method  performed  no  better  than  random 
sampling  of  hits,  i.e.,  the  actives  screened  in  top  100  to  top  1 
000  hits  (Table  5). 

Another  significant  advantage  of  the  Bayesian  model  was 
evident  from  the  nature  of  the  hits  themselves.  The  model 
output  was  not  just  limited  to  finding  only  the  existing  scaffolds, 
i.e.,  similar  to  the  molecules  that  were  used  in  training  of  the 
model,  but  also  included  novel  scaffolds.  Figure  6  shows  the 
structures  of  some  of  the  top  screening  hits  of  Bayesian  model 
that  are  experimentally  active  in  PubChem  Bioassays.  The 
ability  to  discover  novel  antibacterial  scaffolds  is  inherent  in  the 
Bayesian  model  formulation  and  should  represent  an  attractive 
way  to  discover  novel  drug  designs.  In  comparison,  the  2D 
similarity  search  output  is  only  limited  to  finding  molecules  that 
are  similar  to  known  antibacterial  scaffolds. 

■  CONCLUSIONS 

Despite  significant  advances  both  in  understanding  the  biology 
and  the  techniques  available,  antibacterial  drug  discovery  is  still 
an  arduous  task.  In  the  last  couple  of  years,  several  in  silico 
methods  have  emerged  as  important  drug-discovery  tools. 
Currently,  few  studies  exist  that  have  described  the  use  of 
property-based  in  silico  classification  models  for  antibacterial 
activity.  Most  of  these  published  models  show  good  to 
acceptable  discrimination  between  antibacterial  and  non¬ 
antibacterial  classification. 

Our  study  differs  from  these  previous  studies  in  a  number  of 
ways.  First,  our  collection  of  antibacterials  is  vast.  In  previous  in 
silico  studies,  the  antibacterial  collection  range  from  24  to  249 
compounds.  We  have  collected  317  antibacterials  of  nine 
different  classes  from  DrugBank  and  extensive  literature 
searches  and  have  made  them  publically  available  (Supporting 
Information).  This  is  one  of  the  largest  reported  and 
characterized  data  set  of  antibacterials.  Such  a  collection  is 
important  since  the  performance  of  in  silico  classification 
models,  such  as  Bayesian,  heavily  depends  on  the  number  and 
diversity  of  input  training  molecules.  Second,  no  previous  study 


has  ever  attempted  to  effectively  describe  “how  to  collect  non- 
antibacterial”  compounds.  This  is  mainly  because  most  of  the 
studies  tend  to  describe  only  the  positive  results,  i.e.,  the 
compounds  that  turned  out  to  be  active  in  bacterial  assays. 
More  importantly,  even  if  the  data  is  published  concerning 
inactive  compounds,  it  remains  focused  only  on  one  or  few 
selected  species  of  bacteria.  For  a  non-antibacterial  data  set  of 
compounds,  the  ideal  compounds  would  be  those  that  show 
inactivity  against  a  panel  of  different  species  of  bacteria.  Our 
study  is  unique  since  we  have  collected  the  inactive  compounds 
from  215  PubChem  Bioassays  results  that  were  screened  for  a 
wide  panel  of  bacterial  species.  Third,  the  Bayesian  classification 
model  described  in  this  study  has  performed  exceptionally  well. 
The  model  correctly  classified  51  of  the  64  actives  in  an 
independent  test  set  data,  showing  an  overall  accuracy  of  84% 
and  precision  of  86%.  Fourth,  the  model  was  subjected  to  an 
actual  virtual  screen  test  case  of  extracting  high-throughput 
actives  from  PubChem  bacterial  bioassays.  A  comparison  of 
such  a  virtual  screening  test  case  was  also  made  with  a  2D 
similarity  search  method.  The  Bayesian  model  extracted  75.8% 
of  the  actives  from  the  top  1  000  extracted  hits  in  a  scenario 
where  actives  and  inactives  were  mixed  in  a  1:1  ratio.  In  a  more 
stringent  test  case,  where  actives  and  inactives  were  mixed  in  a 
1 :9  ratio,  the  model  extracted  27.6%  actives  from  the  top  1  000 
screened  compounds.  In  comparison,  the  2D  similarity  search 
only  extracted  43.9%  (in  1:1  ratio  of  actives  and  inactive)  and 
11.5%  (in  1:9  ratio  of  actives  and  inactive)  of  the  actives  from 
the  top  1  000  screened  compounds,  which  is  no  better  than 
random  sampling.  Moreover,  while  the  top  actives  retrieved  by 
2D  similarity  search  were  all  from  previously  known  scaffold 
classes,  Bayesian  model  screening  hits  were  well  populated  with 
both  the  novel  scaffolds  as  well  as  previously  known  scaffolds. 

Overall,  the  Bayesian  classification  model  is  a  robust  method 
that  permits  a  quick  in  silico  discovery  of  novel  antibacterials 
candidates  making  use  of  a  minimum  of  resources,  and  it  may 
be  used  as  an  efficient  alternative  to  high-throughput  screening 
of  antibacterial  agents. 

■  ASSOCIATED  CONTENT 
G  Supporting  Information 

Four  worksheets  of  (l)  antibacterial  statistics,  showing  the 
statistical  details  (average  and  standard  deviation)  of  nine 
descriptors  used  in  the  study  for  all  the  data  sets,  i.e., 
antibacterials,  non-antibacterials,  and  drugs  for  human  proteins. 
The  ttest  comparison  between  antibacterial  and  non¬ 
antibacterial  data  sets  is  also  shown.  (2)  Antibacterial  data, 
lists  317  antibacterials  used  in  the  study.  (3)  Nonantibacterial 
data,  lists  317  non-antibacterials  used  in  the  study.  (4)  Human 
drugs  data,  lists  527  human  drugs  used  in  the  study.  This 
material  is  available  free  of  charge  via  the  Internet  at  http:// 
pubs.acs.org. 
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